Board Game Geek Rating Prediction Based on Comments.¶

Author: Md Mintu Miah, ID:1001405116¶

Executive Summary¶

Text analysis is becoming important for Revealing imformation from text content. Due to advancement of machine learning algorithm, now the text analysis is very flexible and interesting in Neuro-lingistic field. The purpose of this project was predicting the game rating based on comments left by the users. For the purpose of the analysis, 30,000 unbalanced and 20,000 balanced random sample was taken from orginal board game geek data. To obtain the best accuracy MNB, MNB with N-grams, linear-SVC, Ensemble model for balanced SVC and Unbalanced SVC under joining condition and VotingClassifier condition was conducted. The accuracy for different experiment in case of unbalaced data set was: MNB accuaracy 27% with hyperparameter 1, MNB with N-grams 27% and SVC accuracy was 29%. But there was problem with these models due to unbalaced data used for training the model, it was not able to capture most of the negative rating (<5) and very high positive rating (>8). To overcome this problem a 20,000 balanced sample was created by taking 2000 samples from each rating. Then the SVC model was trained with balanced sample, after that SVC model was used to predict on unbalanced dataset. It was found that balanced training model accuracy was less than unbalced models, but it was able to capture all kinds of rating. SVC Balanced accuracy was 23% and SVC-unbalanced test accuracy was 20%, here we train SVC model only as we got that SVC performs better than MNB and MNB-N-grams in terms of the accuracy of the prediction. This project conducted two types of ensemable models: Voting Ensemable shows accuacy 29% while Joining Ensemble shows 66% accuracy which was outstanding results for this project. SVC balnced and SVC unbalced model was joined in this best model to predict on unbalced data.This study concludes that Ensemble model performs best than any other models with highest accuarcy. The challanging of this project was selecting the sample size and finding the best machine leaning algorithm and impliment them in a proper way.

Introduction¶

The Board Game Geek (BGG) database is a collection of data and information on traditional board games. The game information was recorded to intend for posterity, historical research, and user-contributed ratings. All the information within the database was meticulously and voluntarily entered on a game-by-game basis by board game user. This information is freely offered through flexible queries and "data mining". BoardGameGeek's ranking charts are ordered using the BGG Rating, which is based on the Average Rating. Game Rating was scaled 1 to 10 to present the sentiment. Understanding the popularity of the game depends on information provided by users, which is very important. For this project, board game reviews was used to predict the rating of the game using Machine learning Algorithm. Three kinds of Machine learning Algorithms (MNB, MNB_N-grams,SVC and Ensemble models) was used here for the whole projects.

Data Description¶

The orginal Board game data is vast (1GB-1,31,70,073 rows data for review file) and time consuming process to clean, that needs excellent memory of the computer. So to avoid complexity this project has taken 30,000 unbalnced and 20,000 balanced data set for the aanalysis and model development. The file contains

GameID
Rating
comment

Purpose of the Project¶

The main purpose of the project is predicting the rating of the game based on the given reviews and understanding how the test classification machine learning algorithm works and improving the outputs on existing references. The second purpose is providing good documentation for the whole process

Naive Bayes Classifier¶

The Naive Bayes classifier is a simple probabilistic classifier which is based on Bayes theorem with strong and naïve independence assumptions. It is one of the most basic text classification techniques with various applications in email spam detection, personal email sorting, document categorization, sexually explicit content detection, language detection and sentiment detection. Despite the naïve design and oversimplified assumptions that this technique uses, Naive Bayes performs well in many complex real-world problems.

Multinomial Naive Bayes¶

Multinominal Naive Bayes (MNB) algorithm has been widely used in text classification due to its computational advantage and simplicity. MNB maximizes likelihood rather than conditional likelihood or accuracy.The task of text classification can be approached from a Bayesian learning perspective, which assumes that the word distributions in documents are generated by a specific parametric model, and the parameters can be estimated from the training data. Beolow Equation shows Multinominal Naive Bayes (MNB) model which is one such parametric model commonly used in text classification where fi is the number of occurrences of a word wi in a document d, P(wijc) is the conditional probability that a word wi may happen in a document d given the class value c, and n is the number of unique words appearing in the document d.Conditional probability P(wijc) can be determined using the relative frequency of the word wi in documents belonging to class c. where fic is the number of times that a word wi appears in all documents with the class label c, and fc is the total number of words in documents with class label c in T.

One advantage of the Multinominal Naive Bayes model is that it can make predictions efficiently.

Multinomial Naive Bayes with N-grams¶

An n-gram is defined either as a textual sequence of length n, or similarly, as a sequence of n adjacent ‘textual units’, in both cases extracted from a particular document. A ‘textual unit’ can be identified at a byte, character or word level depending on the context of interest. N-Grams are the basic method for text categorization. It is also a statistical based approach for classifying text. The N is the number of keywords used for dividing the input text. Based on the number of keywords used, the N-gramsare called as 2-grams, 3-grams, etc.

Support Vector Machine- Linear SVC¶

Linear SVM is the newest extremely fast machine learning algorithm for solving multiclass classification problems from ultra large data sets that implements an original proprietary version of a cutting plane algorithm for designing a linear support vector machine.The objective of a Linear SVC (Support Vector Classifier) is to fit to the data, returning a "best fit" hyperplane that divides, or categorizes the training data.After getting the hyperplane, then the model can be feeded with test sample to classify the "predicted" class.

SVM uses kernel function, which finds the linear hyperplane that separates classes with the maximum margin. The above diagram shows how the data points (that is, support vectors) belonging to two different classes (red versus blue) are separated using the decision boundary based on the maximum margin.

Ensemble Model¶

Ensemble modeling is a process where multiple diverse models are created to predict an outcome, either by using many different modeling algorithms or using different training data sets. The ensemble model then aggregates the prediction of each base model and results in once final prediction for the unseen data. The motivation for using ensemble models is to reduce the generalization error of the prediction.Every model has its strengths and weaknesses. Ensemble models can be beneficial by combining individual models to help hide the weaknesses of an individual model.

Voting classification techniqes in the ensemable predicts based on major votes. For example, if we use three models and they predict [1,0,1]target variable, the final prediction that the ensemble model would make would be 1, since two out of the three models predicted 1.

Analysis Steps or Methods¶

Unbalanced Sample Analysis with MNB, MNB-N-Grams and SVC -¶

Board Game Geek Data Exploration
Cleaning the data
Taking 30,000 unbalanced random Sample
Understanding the unbalanced sample
Top 50 negative and Positive word identification and word cloud for them
Multinomial Naive Bayes,Multinomial Naive Bayes with N-grams(1,2) and Linear SVC model for unbalanced Sample

Balanced Sample Analysis with Best Model SVC-¶

Forming 10,000 random Balanced Sample from orginal data with 2000 samples from each rating (1-10)
Traing the best performed SVC model with Balanced Data set
Apply the balanced trained SVC model to predict unbalanced test data

Ensemble model Development with Balanced and unbalanced SVC model¶

Ensemble model with voting classifier
Ensemble model with joining Balced and Unbalced SVC model

Now lets start with our Board Game Geek Data¶

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/boardgamegeek-reviews/bgg-13m-reviews.csv
/kaggle/input/boardgamegeek-reviews/games_detailed_info.csv
/kaggle/input/boardgamegeek-reviews/2019-05-02.csv
/kaggle/input/bgg-comments/boardgame-comments-sample.csv
/kaggle/input/board-game-greck-reviews/test_predictions.csv
/kaggle/input/board-game-greck-reviews/reviews_sampled.csv

Import Library¶

import pandas as pd
import numpy as np
import string
import nltk
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn import svm, linear_model
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.ensemble import VotingClassifier 
sns.set(color_codes=True)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_score, train_test_split
import random
from sklearn.metrics import accuracy_score
from collections import Counter
from sklearn.metrics import accuracy_score

Lets start with importing our orinal data source file¶

review_data0 = pd.read_csv('../input/boardgamegeek-reviews/bgg-13m-reviews.csv', index_col=0)
review_data0.head()

/opt/conda/lib/python3.7/site-packages/numpy/lib/arraysetops.py:569: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask |= (ar1 == a)

** we will use .shape to see the number of rows and columns in our data file

review_data0.shape

(13170073, 5)

Orginal data file has 13,170,073 rows without cleaning and 5 columns. We will Remove all NaN rows from comment columns¶

So, Remove all NaN rows from comment column

review_data2=review_data0[~review_data0.comment.str.contains("NaN",na=True)]
review_data2.head()

We removed all missing comments row, now lets see the shape of the file¶

review_data2.shape

(2637755, 5)

Stll our current data table has 26,37,755 rows and 5 columns which is huge. Before taking sample, lets check the description of data and rating bar graph to get the idea of rating frequency in whole data¶

review_data2.describe()

#plot histogram of ratings
num_bins = 70
n, bins, patches = plt.hist(review_data2.rating, num_bins, facecolor='green', alpha=0.9)

#plt.xticks(range(9000))
plt.title('Histogram of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()

From histogram, it is clear that the highest rating frquency is 7 and than 8,6. The rating is included decimal like 4.2,4.3, 4.4,4.5,5.5,6.5,7.5 and others. But for our analysis we will take the rounded intiger value of rating.¶

It is clear that orginal data size is too big (13170073 rows) and needs lot of memory to calculate, so we will take subsample data to develop the model. I have taken 30,000 data for developing the model.¶

Now lets take 30,000 samples (Unbalanced)¶

review_data2.head()
review_data3=review_data2.sample(n=30000)
review_data3.head()

We need only rating and comment columns, but we will keep all as it will not hamper our preocess of analysis

Lets check data type with dtypes command

review_data3.dtypes

user        object
rating     float64
comment     object
ID           int64
name        object
dtype: object

our current rating data is float type and comment is object or combination of text, links, numbers and so on.

Lets check if we have any missing data yet

review_data3.isna().sum()

user       0
rating     0
comment    0
ID         0
name       0
dtype: int64

we do not have any missing data as we already removed all missing data very begining of the data exploration

Plot histogram of word count¶

review_data3['word_count']  = review_data3.comment.str.len()

num_bins = 70
n, bins, patches = plt.hist(review_data3.word_count, num_bins, facecolor='green', alpha=0.9)

#plt.xticks(range(9000))
plt.title('Histogram of Word Count')
plt.xlabel('Word Count')
plt.ylabel('Count')
plt.show()

The above histrogram shows us the frequency of word in our data set.

Making lowercase, removing punctuation and stop words¶

#lowercase and remove punctuation
review_data3['cleaned'] = review_data3['comment'].str.lower().apply(lambda x:''.join([i for i in x if i not in string.punctuation]))

# stopword list to use
stopwords_list = stopwords.words('english')
stopwords_list.extend(('game','play','played','players','player','people','really','board','games','one','plays','cards','would')) 

stopwords_list[-10:]

#remove stopwords
review_data3['cleaned'] = review_data3['cleaned'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))
review_data3.head()

We have made lower case of all words in comment, removed punctuation and stop words to get the unique, meaningful and clear comments for the analysis. Lower case will halp to get same format of similar kinds of word. stopwords do not carry any meaningful significance.

Stop Words: Stop words are basically a set of commonly used words in any language, not just English.The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead. For example, in the context of a search engine, if your search query is “how to develop information retrieval applications”, If the search engine tries to find web pages that contained the terms “how”, “to” “develop”, “information”, ”retrieval”, “applications” the search engine is going to find a lot more pages that contain the terms “how”, “to” than pages that contain information about developing information retrieval applications because the terms “how” and “to” are so commonly used in the English language. If we disregard these two terms, the search engine can actually focus on retrieving pages that contain the keywords: “develop” “information” “retrieval” “applications” – which would bring up pages that are actually of interest.

plot histogram of ratings of our unbalanced sample¶

num_bins = 70
n, bins, patches = plt.hist(review_data3.rating, num_bins, facecolor='green', alpha=0.9)

#plt.xticks(range(9000))
plt.title('Histogram of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()

So, it is clear that our unbalced sample has similar rating pattern like orginal sample.

Now lets see top 50 common words¶

Counter(" ".join(review_data3["cleaned"]).split()).most_common(50)[:50]

[('like', 6015),
 ('fun', 5959),
 ('good', 4504),
 ('great', 3750),
 ('much', 3372),
 ('get', 3178),
 ('time', 3141),
 ('card', 2344),
 ('playing', 2308),
 ('rules', 2272),
 ('first', 2257),
 ('little', 2230),
 ('well', 2224),
 ('better', 2144),
 ('lot', 2107),
 ('dont', 2075),
 ('bit', 2008),
 ('still', 1950),
 ('love', 1941),
 ('theme', 1939),
 ('also', 1876),
 ('interesting', 1871),
 ('think', 1815),
 ('nice', 1793),
 ('2', 1757),
 ('best', 1703),
 ('even', 1652),
 ('make', 1646),
 ('easy', 1640),
 ('many', 1617),
 ('simple', 1598),
 ('im', 1553),
 ('two', 1526),
 ('dice', 1491),
 ('long', 1449),
 ('strategy', 1445),
 ('way', 1424),
 ('though', 1410),
 ('enough', 1378),
 ('different', 1375),
 ('luck', 1327),
 ('quite', 1281),
 ('see', 1272),
 ('3', 1270),
 ('pretty', 1240),
 ('rating', 1230),
 ('de', 1216),
 ('could', 1194),
 ('take', 1193),
 ('feel', 1188)]

Like, fun, good, great,much, get,time, rules, well, playing are the top 10 words repeated within the comment. Now, lets define the rating as positive (rating>8) and negative (<3) and display top 100 positive words and negative words that are very useful to predict the rating.¶

Lets see word clouds for positive words and negative words¶

from wordcloud import WordCloud
from collections import Counter

neg = review_data3.loc[review_data3['rating'] < 3]
pos = review_data3.loc[review_data3['rating'] > 8]


words = Counter([w for w in " ".join(pos['cleaned']).split()])

wc = WordCloud(width=400, height=350,colormap='plasma',background_color='white').generate_from_frequencies(dict(words.most_common(100)))
plt.figure(figsize=(20,15))
plt.imshow(wc, interpolation='bilinear')
plt.title('Common Words in Positive Reviews', fontsize=20)
plt.axis('off');
plt.show()


words = Counter([w for w in " ".join(neg['cleaned']).split()])

wc = WordCloud(width=400, height=350,colormap='plasma',background_color='white').generate_from_frequencies(dict(words.most_common(100)))
plt.figure(figsize=(20,15))
plt.imshow(wc, interpolation='bilinear')
plt.title('Common Words in Negative Reviews', fontsize=20)
plt.axis('off');
plt.show()

A word cloud of positive (rating > 8) and negative (rating < 3) reviews was generated as above. The positive word cloud contains mostly positive 100 words, the negative word cloud contains a mix of 100 words that are not necessarily negative

lets check mean, median and mode of rating of our unbalced sample¶

print('Mean: ', review_data3.rating.mean())
print('Median: ', review_data3.rating.median())
print('Mode: ', review_data3.rating.mode())

Mean:  6.856034572666648
Median:  7.0
Mode:  0    7.0
dtype: float64

Now define Necessary function to calculate RMSE,Weighted RMSE MAF and model assessment¶

def calc_rmse(errors, weights=None):
    n_errors = len(errors)
    if weights is None:
        result = sqrt(sum(error ** 2 for error in errors) / n_errors)
    else:
        result = sqrt(sum(weight * error ** 2 for weight, error in zip(weights, errors)) / sum(weights))
    return result

#if the score is far from mean (high or low scores), weight those reviews and ratings more when assessing model accuracy
def calc_weights(scores):
    peak = 6.851
    return tuple((10 ** (0.3556 * (peak - score))) if score < peak else (10 ** (0.2718 * (score - peak))) for score in scores)


def assess_model( model_name, test, predicted):
    error = test - predicted
    rmse = calc_rmse(error)
    mae = mean_absolute_error(test, predicted)
    weights = calc_weights(test)
    weighted_rmse = calc_rmse(error, weights = weights)
    
    
    print(model_name)
    print('RMSE:',rmse)
    print('Weighed RMSE:', weighted_rmse)
    print('MAE:', mae)

Lets do MNB model to predict the rating based on comments¶

The 30 thousands unbalanced data was splitted as train and test set for the modeling, then pipeline was used to tune the model.

count_vectorizer - Breaks up the text into a matrix with each word (called "token" in NLP) being the column of the matrix and the value being the count of occurences.
ngram_range - Optional parameter to extract the text in groups of 2 or more words together. This is useful because the modifiers such as 'not' can be used to change the following word's meaning.
stopwords - Removes any words from the stopwords list created in the data exploration step.
lowercase - Converts all text into lowercase.
tfidf_transformer - Weighs terms by importance to help with feature selection.
classifier - two types of models was used, multi-class classification, Multinomial NB and LinearSVC

Model performance will be judged with the accuracy value

X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, random_state=44,test_size=0.20)

model_nb = Pipeline([
    ('count_vectorizer', CountVectorizer(lowercase = True, stop_words = stopwords.words('english'))), 
    ('tfidf_transformer',  TfidfTransformer()), #weighs terms by importance to help with feature selection
    ('classifier', MultinomialNB()) ])
    
model_nb.fit(X_train,y_train.astype('int'))
labels = model_nb.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Multinomial NB", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))

Multinomial NB
RMSE: 1.7390785011429706
Weighed RMSE: 3.8230391348432726
MAE: 1.292360885

****Test accuracy is 26.866666666666667

The accuracy of MNB model is 27 % for unbalanced sample but the problem of the above model is that it did not predict rating 1-4 and 8-9 (confusion Matrix), it has only predicted around the average value of the rating (6.85)¶

Lets do MNB model with N-grams to predict the rating based on comments¶

#Experimented with adding different numbers of n-grams, 1-2 seems to have best performance
model_nb2 = Pipeline([
    ('count_vectorizer', CountVectorizer( ngram_range=(1,2), lowercase = True, stop_words = stopwords.words('english'))), 
    ('tfidf_transformer',  TfidfTransformer()), #weighs terms by importance to help with feature selection
    ('classifier', MultinomialNB()) ])
    
model_nb2.fit(X_train,y_train.astype('int'))
labels = model_nb2.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Multinomial NB n-grams 1-2", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))

Multinomial NB n-grams 1-2
RMSE: 1.7543120303424782
Weighed RMSE: 3.871180517324755
MAE: 1.3020942183333333

****Test accuracy is 26.883333333333333

The accuracy of MNB model with N-grams is almost same of MNB (27 %) for unbalanced sample but the problem of the above model is that it also did not predict rating 1-4 and 8-9 (confusion Matrix), it has only predicted around the average value of the rating (6.85).¶

Lets check with different hyperparameter for MNB Model¶

# Convert the data in vector fpormate
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
tf_idf_train = tf_idf_vect.fit_transform(X_train)
tf_idf_test = tf_idf_vect.transform(X_test)

alpha_range = list(np.arange(0,30,1))
len(alpha_range)

30

Take different values of alpha in cross validation and finding the accuracy score¶

from sklearn.naive_bayes import MultinomialNB
y_train=y_train.astype('int')

alpha_scores=[]

for a in alpha_range:
    clf = MultinomialNB(alpha=a)
    scores = cross_val_score(clf, tf_idf_train, y_train, cv=5, scoring='accuracy')
    alpha_scores.append(scores.mean())
    print(a,scores.mean())

/opt/conda/lib/python3.7/site-packages/sklearn/naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
  'setting alpha = %.1e' % _ALPHA_MIN)
/opt/conda/lib/python3.7/site-packages/sklearn/naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
  'setting alpha = %.1e' % _ALPHA_MIN)
/opt/conda/lib/python3.7/site-packages/sklearn/naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
  'setting alpha = %.1e' % _ALPHA_MIN)
/opt/conda/lib/python3.7/site-packages/sklearn/naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
  'setting alpha = %.1e' % _ALPHA_MIN)
/opt/conda/lib/python3.7/site-packages/sklearn/naive_bayes.py:507: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
  'setting alpha = %.1e' % _ALPHA_MIN)

0 0.2229166666666667
1 0.2659166666666667
2 0.2637083333333333
3 0.26308333333333334
4 0.26279166666666665
5 0.2628333333333333
6 0.2628749999999999
7 0.26266666666666666
8 0.26266666666666666
9 0.2625833333333333
10 0.26245833333333335
11 0.26241666666666663
12 0.2622083333333333
13 0.2622083333333333
14 0.2622083333333333
15 0.262125
16 0.262125
17 0.26216666666666666
18 0.26216666666666666
19 0.2620416666666666
20 0.262
21 0.262
22 0.2619166666666667
23 0.2619166666666667
24 0.2619166666666667
25 0.2619166666666667
26 0.2619166666666667
27 0.2619166666666667
28 0.2619166666666667
29 0.2619166666666667

# Plot b/w misclassification error and CV mean score.
import matplotlib.pyplot as plt

MSE = [1 - x for x in alpha_scores]


optimal_alpha_bnb = alpha_range[MSE.index(min(MSE))]

# plot misclassification error vs alpha
plt.plot(alpha_range, MSE)

plt.xlabel('hyperparameter alpha')
plt.ylabel('Misclassification Error')
plt.show()

optimal_alpha_bnb

1

It is found that hyper-parameter 1 performs best to predict the rating,our MNB automatically use hyperparameter 1. so already did this model.

model_nb = Pipeline([
    ('count_vectorizer', CountVectorizer(lowercase = True, stop_words = stopwords.words('english'))), 
    ('tfidf_transformer',  TfidfTransformer()), #weighs terms by importance to help with feature selection
    ('classifier', MultinomialNB(alpha=optimal_alpha_bnb)) ])
    
model_nb.fit(X_train,y_train.astype('int'))
labels = model_nb.predict(X_test)
mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Multinomial NB", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))

Multinomial NB
RMSE: 1.7390785011429706
Weighed RMSE: 3.8230391348432726
MAE: 1.292360885

****Test accuracy is 26.866666666666667

We have already done this using MNB model.¶

Lets try Linear SVC Model¶

model_svc = make_pipeline(TfidfVectorizer(ngram_range=(1,3)), svm.SVC(kernel="linear",probability=True))
model_svc.fit(X_train, y_train.astype('int'))
labels = model_svc.predict(X_test)

mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Linear SVC model", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))

Linear SVC model
RMSE: 1.6323990116199172
Weighed RMSE: 3.4626254301869106
MAE: 1.190857885

****Test accuracy is 29.95

SVC model performs better than any others model as Accuracy= 29% which is greater than MNB and MNB-N-grams¶

All the above models are simply predicting reviews around the average rating accept SVC model. MNB and MNB-N-grams did not predict any low reviews. This is because the training data is so unbalanced that it can't detect a negative review.¶

To overcome this situation. we need to create a balnace data set as the subset of orinal data to train the model, then we need to use that model to predict on the unbalanced sample. In this case, we will train SVC model as it performs better than other two.¶

Let us create a Balanced sample that have 2000 sample from each rating¶

review_data2.head()
rating1_subset = review_data2[review_data2['rating']==1] 
rating1_subset.head()

# Slect 100 sample that have rating =1
r1=rating1_subset.sample(2000)
r1.head()


rating2_subset = review_data2[review_data2['rating']==2] 
rating2_subset.head()
# Slect 100 sample that have rating =2
r2=rating2_subset.sample(2000)
r2.head()

rating3_subset = review_data2[review_data2['rating']==3] 
rating3_subset.head()
# Slect 100 sample that have rating =3
r3=rating3_subset.sample(2000)
r3.head()

rating4_subset = review_data2[review_data2['rating']==4] 
rating4_subset.head()
# Slect 100 sample that have rating =4
r4=rating4_subset.sample(2000)
r4.head()

rating5_subset = review_data2[review_data2['rating']==5] 
rating5_subset.head()
# Slect 100 sample that have rating =5
r5=rating5_subset.sample(2000)
r5.head()

rating6_subset = review_data2[review_data2['rating']==6] 
rating6_subset.head()
# Slect 100 sample that have rating =6
r6=rating6_subset.sample(2000)
r6.head()

rating7_subset = review_data2[review_data2['rating']==7] 
rating7_subset.head()
# Slect 100 sample that have rating =7
r7=rating7_subset.sample(2000)
r7.head()

rating8_subset = review_data2[review_data2['rating']==8] 
rating8_subset.head()
# Slect 100 sample that have rating =8
r8=rating8_subset.sample(2000)
r8.head()

rating9_subset = review_data2[review_data2['rating']==9] 
rating9_subset.head()
# Slect 100 sample that have rating=9
r9=rating9_subset.sample(2000)
r9.head()

rating10_subset = review_data2[review_data2['rating']==10] 
rating10_subset.head()
# Slect 100 sample that have rating=10
r10=rating10_subset.sample(2000)
r10.head()

Now combined all 20,000 samples with 2000 samples for each rating-Balanced Sample¶

review_balance=df = r1.append([r2, r3,r4,r5,r6,r7,r8,r9,r10])
review_balance.head()

review_balance.shape

(20000, 5)

so, our balanced sample has total 2000 rows from 200 samples for each rating. Lets clean these balance sample once again

Making lowercase, removing punctuation and stop words from Balanced sample¶

#lowercase and remove punctuation
review_balance['cleaned'] = review_balance['comment'].str.lower().apply(lambda x:''.join([i for i in x if i not in string.punctuation]))

# stopword list to use
stopwords_list = stopwords.words('english')
stopwords_list.extend(('game','play','played','players','player','people','really','board','games','one','plays','cards','would')) 

stopwords_list[-10:]

#remove stopwords
review_balance['cleaned'] = review_balance['cleaned'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))
review_balance.head()

Now lets see the balanced rating¶

#plot histogram of ratings
num_bins = 70
n, bins, patches = plt.hist(review_balance.rating, num_bins, facecolor='green', alpha=0.9)

#plt.xticks(range(9000))
plt.title('Histogram of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()

If you see the above bar diagram, then it is clear that all rating have same number of samples which we named balanced sample. Now lets work with this balance sample. First, we will re-train our SVC model with balanced sample, then we will apply this model to predict test set from unbalanced sanple.¶

Now we are ready to re-train our SVC model with the balance data¶

X_train1, X_test1, y_train1, y_test1 = train_test_split(review_balance.cleaned, review_balance.rating, test_size=0.20)
model_svc_balance = make_pipeline(TfidfVectorizer(ngram_range=(1,3)), svm.SVC(kernel="linear",probability=True))
model_svc_balance.fit(X_train1, y_train1.astype('int'))
labels = model_svc_balance.predict(X_test1)

mat = confusion_matrix(y_test1.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Linear SVC Balanced model", y_test1,labels)
acc = accuracy_score(y_test1.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy is',(acc))

Linear SVC Balanced model
RMSE: 2.6094060626893625
Weighed RMSE: 2.912383001169844
MAE: 1.8385

****Test accuracy is 23.65

** Although now it has captured all rating catagory but the accuracy is less than unbalanced sample.

Finally use the retrain SVC model to test the prediction of unbalanced data¶

X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, test_size=0.20)
labels = model_svc_balance.predict(X_test)

mat = confusion_matrix(y_test.astype('int'), labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()

assess_model("Linear SVC model", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy of re-trained SVC is',(acc))

Linear SVC model
RMSE: 2.3992849035094483
Weighed RMSE: 2.736905640073932
MAE: 1.8000890833333336

****Test accuracy of re-trained SVC is 19.983333333333334

It seems that the accuracy has decreased after using balnced data as the training. Although it has captured all rating class but the error rate is high as we have got accuracy only 29% while it was 30% from SVC model in case of unbalanced data but it was biased results.¶

Lets try us with Ensemble model to see the performance.¶

Create ensemble model using VotingClassifier¶

X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, test_size=0.20)

Ensemble = VotingClassifier(estimators=[('model_svc_unbalance',model_svc), ('model_svc_balance', model_svc_balance )],
                        voting='soft',
                        weights=[3, 1])

Ensemble.fit(X_train,y_train.astype(int))


labels = Ensemble.predict(X_test)
mat = confusion_matrix(y_test.astype(int), labels)
ax = sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Ensemble model", y_test,labels)
acc = accuracy_score(y_test.astype('int'),labels, normalize=True) * float(100)
print('\n****Test accuracy of Ensemble SVC is',(acc))

Ensemble model
RMSE: 1.665938821715081
Weighed RMSE: 3.013509632025085
MAE: 1.2075596700000002

****Test accuracy of Ensemble SVC is 28.833333333333332

Ensemble model with voting classifier shows only 29% accuracy which is still same with our SVC-unbalanced model (29%).¶

#join results of SVC models on balanced and unbalanced data to create ensemble model¶

X_train, X_test, y_train, y_test = train_test_split(review_data3.cleaned, review_data3.rating, test_size=0.20)

labels = model_svc.predict(X_test)
labels_2 = model_svc_balance.predict(X_test)


pred = pd.concat([pd.DataFrame(y_test).reset_index().rating,pd.Series(labels),pd.Series(labels_2)],axis=1)
pred.columns = ['rating','model_1','model_2']

pred = pd.concat([pd.DataFrame(y_test).reset_index().rating,pd.Series(labels),pd.Series(labels_2)],axis=1)
pred.columns = ['rating','model_1','model_2']

pred['final'] = np.where(pred.model_2 >= 3, np.where(pred.model_2 <= 9, pred.model_1, pred.model_2), pred.model_2)
#pred['final'] = np.where(pred.model_2 <= 9, pred.model_1, pred.model_2)
pred.tail()

mat = confusion_matrix(pred.rating.astype(int), pred.final)
ax = sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False )
plt.xlabel('true label')
plt.ylabel('predicted label');
plt.show()
assess_model("Ensemble model", pred.rating,pred.final)

acc = accuracy_score(pred.rating.astype(int),pred.final, normalize=True) * float(100)
print('\n****Test accuracy of Ensemble SVC is',(acc))

Ensemble model
RMSE: 1.7523654870036909
Weighed RMSE: 2.3686473512013473
MAE: 0.8857402483333333

****Test accuracy of Ensemble SVC is 66.95

WOOOW! Finally we have got 66% accuracy which is best than any others model we discussed here. It has also captured all kinds of rating from our data set rather than capturing rating close the mean.¶

Summary¶

Models Name and Accuracy:¶

Multinomial Naïve Bayes(MNB) for Unbalanced dataset 27%¶

MNB with N-grams for Unbalanced dataset 27%¶

SVC-Linear for Unbalanced dataset 29%¶

SVC-Linear Re-trained with Balanced Dataset 25%¶

SVC-Linear Re-trained on Unbalanced dataset 20%¶

Ensemble Model with Voting Classifier 29%¶

Ensemble Model with Balanced and Unbalanced SVC 66%¶

To obtain the best accuracy MNB, MNB with N-grams, linear-SVC, Ensemble models was conducted. The accuracy for different models in case of unbalaced data set was: MNB accuaracy 27% with hyperparameter 1, MNB with N-grams 27% and SVC accuracy was 29%. But due to unbalaced data, all these models captured prediction around the mean, it was not able to capture most of the negative rating (<5) and very high positive rating (>8). SVC models performs better than MNB and MNB-N-grams and it captured both low and high rating. To overcome this problem a 20,000 balanced sample was created by taking 2000 samples from each rating. Then the SVC model was trained with balanced sample, after that SVC model was used to predict on unbalanced dataset. It was found that balanced training model accuracy was less than unbalced models, but it was able to capture all kinds of rating. SVC Balanced accuracy was 25% and SVC-unbalanced test accuracy was 20%, here we train and test SVC models only as we got previously that SVC performs better than MNB and MNB-N-grams in terms of best accuracy of the prediction. Voting Ensemable shows accuacy 29% while Joining Ensemble shows 66% accuracy which is outstanding results for this project. SVC balnced and SVC unbalced model was joined in this best model to predict the rating on unbalanced dataset. This study concludes that Ensemble model performs best than any other models with highest accuarcy.¶

Challenges and Improvements¶

The challanging of this project was handling big data,selecting the sample size and finding the best machine leaning algorithm and impliment them in a proper way to make the model best performed. I tried with different sample sizes, accuracy varied with the sample size. I have taken the sample size that can be handled with kaggle server and minimize run time, at the same time, shows good accuaracy rate. One of the another challanges was finding the good accuarcy from the model. This projects has given overview of performance of different models in terms of accuracy over the existing references.¶

	user	rating	comment	ID	name
0	sidehacker	10.0	NaN	13	Catan
1	Varthlokkur	10.0	NaN	13	Catan
2	dougthonus	10.0	Currently, this sits on my list as my favorite...	13	Catan
3	cypar7	10.0	I know it says how many plays, but many, many ...	13	Catan
4	ssmooth	10.0	NaN	13	Catan

	user	rating	comment	ID	name
2	dougthonus	10.0	Currently, this sits on my list as my favorite...	13	Catan
3	cypar7	10.0	I know it says how many plays, but many, many ...	13	Catan
7	hreimer	10.0	i will never tire of this game.. Awesome	13	Catan
11	daredevil	10.0	This is probably the best game I ever played. ...	13	Catan
16	hurkle	10.0	Fantastic game. Got me hooked on games all ove...	13	Catan

	rating	ID
count	2.637755e+06	2.637755e+06
mean	6.852071e+00	6.693992e+04
std	1.775769e+00	7.304448e+04
min	1.401300e-45	1.000000e+00
25%	6.000000e+00	3.955000e+03
50%	7.000000e+00	3.126000e+04
75%	8.000000e+00	1.296220e+05
max	1.000000e+01	2.724090e+05

	user	rating	comment	ID	name
9058383	feldfan2014	8.0	Replaced with English version	90040	Pergamon
3135010	jimmyhudson	4.0	Theme isn't that interesting to me so I know s...	100901	Flash Point: Fire Rescue
300	Schwarzie2478	8.0	Everything plays realy smooth, it's good that ...	122842	Exodus: Proxima Centauri
2101261	dipplestix	9.0	This game is excellent (with all three expansi...	2655	Hive
329336	fateswanderer	9.0	Speedy, casual game with a fantastic mechanic ...	36218	Dominion

	user	rating	comment	ID	name	word_count	cleaned
9058383	feldfan2014	8.0	Replaced with English version	90040	Pergamon	29	replaced english version
3135010	jimmyhudson	4.0	Theme isn't that interesting to me so I know s...	100901	Flash Point: Fire Rescue	95	theme isnt interesting know someone else proba...
300	Schwarzie2478	8.0	Everything plays realy smooth, it's good that ...	122842	Exodus: Proxima Centauri	129	everything realy smooth good turn simultanuous...
2101261	dipplestix	9.0	This game is excellent (with all three expansi...	2655	Hive	455	excellent three expansions base lbm makes adva...
329336	fateswanderer	9.0	Speedy, casual game with a fantastic mechanic ...	36218	Dominion	957	speedy casual fantastic mechanic allows massiv...

	user	rating	comment	ID	name
4274325	Misiodziej	10.0	Bw:5	127023	Kemet
2507106	boulette de steak	10.0	It's my favorite game !!! Without doubt the be...	42	Tigris & Euphrates
8835445	Kejben	10.0	Great two player strategy game with a lot of l...	82421	Summoner Wars: Phoenix Elves vs Tundra Orcs
3430448	Koert	10.0	Excellent epic complex card-driven civilisatio...	25613	Through the Ages: A Story of Civilization
5067212	Hummingbirdmagic	10.0	I have a great interest in history, especially...	171668	The Grizzled

	user	rating	comment	ID	name
681566	Numskull	1.0	A twelve hour war game disguised as a civiliza...	3870	7 Ages
462726	mgringo	1.0	N/C. not worth it.	16398	War
6491275	newkillerstar27	1.0	Far too long and far too random.	24310	The Red Dragon Inn
9558637	LosSchabossDragon	1.0	The whole theme is distasteful. Ok a lot of em...	65282	Tanto Cuore
2412076	lortelars	1.0	An incredibly boring game with "roll to move" ...	1406	Monopoly

	rating	model_1	model_2	final
5995	7.0	7	4	7
5996	7.0	7	5	7
5997	6.5	6	7	6
5998	7.5	6	7	6
5999	7.0	7	6	7